Unigram Language Model for Chinese Word Segmentation

نویسندگان

  • Aitao Chen
  • Yiping Zhou
  • Anne Zhang
  • Gordon Sun
چکیده

This paper describes a Chinese word segmentation system based on unigram language model for resolving segmentation ambiguities. The system is augmented with a set of pre-processors and post-processors to extract new words in

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Contextual Dependencies in Unsupervised Word Segmentation

Developing better methods for segmenting continuous text into words is important for improving the processing of Asian languages, and may shed light on how humans learn to segment speech. We propose two new Bayesian word segmentation methods that assume unigram and bigram models of word dependencies respectively. The bigram model greatly outperforms the unigram model (and previous probabilistic...

متن کامل

Chinese Unknown Word Identification Based on Local Bigram Model

This paper presents a Chinese unknown word identification system based on a local bigram model. Generally, our word segmentation system employs a statistical-based unigram model. But to identify those unknown words, we take advantage of their contextual information and apply a bigram model locally. By adjusting the value of interpolation which is derived from a smoothing method, we combine thes...

متن کامل

Chinese Unknown Word Identification Based on Local Bigram Model with Integrally Smoothing Assumption

The paper presents a Chinese unknown word identification system based on a local bigram model. Generally, our word segmentation system employs a statistical-based unigram model. But to identify those unknown words, we take advantage of their contextual information and apply a bigram model locally. To explain this local approximation, we make an “integrally smoothing assumption”. As a simplifica...

متن کامل

A Projection Extension Algorithm for Statistical Machine Translation

In this paper, we describe a phrase-based unigram model for statistical machine translation that uses a much simpler set of model parameters than similar phrasebased models. The units of translation are blocks – pairs of phrases. During decoding, we use a block unigram model and a word-based trigram language model. During training, the blocks are learned from source interval projections using a...

متن کامل

Closed-Set Chinese Word Segmentation Based on Convolutional Neural Network Model

This paper proposes a neural model for closed-set Chinese word segmentation. The model follows the character-based approach which assigns a class label to each character, indicating its relative position within the word it belongs to. To do so, it first constructs shallow representations of characters by fusing unigram and bigram information in limited context window via an element-wise maximum...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005